{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\" />\n",
    "    \n",
    "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
    "\n",
    "Author: [Egor Polusmak](https://www.linkedin.com/in/egor-polusmak/). Translated and edited by Alena Sharlo, [Yury Kashnitsky](https://yorko.github.io), [Artem Trunov](https://www.linkedin.com/in/datamove), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center>Topic 2. Visual data analysis in Python\n",
    "## <center>Part 2. Overview of Seaborn, Matplotlib and Plotly libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Article outline\n",
    "\n",
    "1. [Dataset](#1.-Dataset)\n",
    "2. [DataFrame.plot()](#2.-DataFrame.plot)\n",
    "3. [Seaborn](#3.-Seaborn)\n",
    "4. [Plotly](#4.-Plotly)\n",
    "5. [Demo assignment](#5.-Demo-assignment)\n",
    "6. [Useful resources](#6.-Useful-resources)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Dataset\n",
    "\n",
    "First, we will set up our environment by importing all necessary libraries. We will also change the display settings to better show plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Matplotlib forms basis for visualization in Python\n",
    "import matplotlib.pyplot as plt\n",
    "# We will use the Seaborn library\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set()\n",
    "\n",
    "# Graphics in SVG format are more sharp and legible\n",
    "%config InlineBackend.figure_format = 'svg'\n",
    "\n",
    "# Increase the default plot size and set the color scheme\n",
    "plt.rcParams[\"figure.figsize\"] = (8, 5)\n",
    "plt.rcParams[\"image.cmap\"] = \"viridis\"\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let’s load the dataset that we will be using into a `DataFrame`. I have picked a dataset on video game sales and ratings from [Kaggle Datasets](https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings).\n",
    "Some of the games in this dataset lack ratings; so, let’s filter for only those examples that have all of their values present."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"../../data/video_games_sales.csv\").dropna()\n",
    "print(df.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, print the summary of the `DataFrame` to check data types and to verify everything is non-null."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that `pandas` has loaded some of the numerical features as `object` type. We will explicitly convert those columns into `float` and `int`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"User_Score\"] = df[\"User_Score\"].astype(\"float64\")\n",
    "df[\"Year_of_Release\"] = df[\"Year_of_Release\"].astype(\"int64\")\n",
    "df[\"User_Count\"] = df[\"User_Count\"].astype(\"int64\")\n",
    "df[\"Critic_Count\"] = df[\"Critic_Count\"].astype(\"int64\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The resulting `DataFrame` contains 6825 examples and 16 columns. Let’s look at the first few entries with the `head()` method to check that everything has been parsed correctly. To make it more convenient, I have listed only the variables that we will use in this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "useful_cols = [\n",
    "    \"Name\",\n",
    "    \"Platform\",\n",
    "    \"Year_of_Release\",\n",
    "    \"Genre\",\n",
    "    \"Global_Sales\",\n",
    "    \"Critic_Score\",\n",
    "    \"Critic_Count\",\n",
    "    \"User_Score\",\n",
    "    \"User_Count\",\n",
    "    \"Rating\",\n",
    "]\n",
    "df[useful_cols].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. DataFrame.plot()\n",
    "\n",
    "Before we turn to Seaborn and Plotly, let’s discuss the simplest and often most convenient way to visualize data from a `DataFrame`: using its own `plot()` method.\n",
    "\n",
    "As an example, we will create a plot of video game sales by country and year. First, let’s keep only the columns we need. Then, we will calculate the total sales by year and call the `plot()` method on the resulting `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[[x for x in df.columns if \"Sales\" in x] + [\"Year_of_Release\"]].groupby(\n",
    "    \"Year_of_Release\"\n",
    ").sum().plot();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the implementation of the `plot()` method in `pandas` is based on `matplotlib`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the `kind` parameter, you can change the type of the plot to, for example, a *bar chart*. `matplotlib` is generally quite flexible for customizing plots. You can change almost everything in the chart, but you may need to dig into the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html) to find the corresponding parameters. For example, the parameter `rot` is responsible for the rotation angle of ticks on the x-axis (for vertical plots):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[[x for x in df.columns if \"Sales\" in x] + [\"Year_of_Release\"]].groupby(\n",
    "    \"Year_of_Release\"\n",
    ").sum().plot(kind=\"bar\", rot=45);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Seaborn\n",
    "\n",
    "Now, let's move on to the `Seaborn` library. `seaborn` is essentially a higher-level API based on the `matplotlib` library. Among other things, it differs from the latter in that it contains more adequate default settings for plotting. By adding `import seaborn as sns; sns.set()` in your code, the images of your plots will become much nicer. Also, this library contains a set of complex tools for visualization that would otherwise (i.e. when using bare `matplotlib`) require quite a large amount of code.\n",
    "\n",
    "#### pairplot()\n",
    "\n",
    "Let's take a look at the first of such complex plots, a *pairwise relationships plot*, which creates a matrix of scatter plots by default. This kind of plot helps us visualize the relationship between different variables in a single output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# `pairplot()` may become very slow with the SVG format\n",
    "%config InlineBackend.figure_format = 'png'\n",
    "sns.pairplot(\n",
    "    df[[\"Global_Sales\", \"Critic_Score\", \"Critic_Count\", \"User_Score\", \"User_Count\"]]\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the distribution histograms lie on the diagonal of the matrix. The remaining charts are scatter plots for the corresponding pairs of features.\n",
    "\n",
    "#### distplot()\n",
    "\n",
    "It is also possible to plot a distribution of observations with `seaborn`'s `distplot()`. For example, let's look at the distribution of critics' ratings: `Critic_Score`. By default, the plot displays a histogram and the [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%config InlineBackend.figure_format = 'svg'\n",
    "sns.distplot(df[\"Critic_Score\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### jointplot()\n",
    "\n",
    "To look more closely at the relationship between two numerical variables, you can use *joint plot*, which is a cross between a scatter plot and histogram. Let's see how the `Critic_Score` and `User_Score` features are related."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.jointplot(x=\"Critic_Score\", y=\"User_Score\", data=df, kind=\"scatter\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### boxplot()\n",
    "\n",
    "Another useful type of plot is a *box plot*. Let's compare critics' ratings for the top 5 biggest gaming platforms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_platforms = (\n",
    "    df[\"Platform\"].value_counts().sort_values(ascending=False).head(5).index.values\n",
    ")\n",
    "sns.boxplot(\n",
    "    y=\"Platform\",\n",
    "    x=\"Critic_Score\",\n",
    "    data=df[df[\"Platform\"].isin(top_platforms)],\n",
    "    orient=\"h\",\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is worth spending a bit more time to discuss how to interpret a box plot. Its components are a *box* (obviously, this is why it is called a *box plot*), the so-called *whiskers*, and a number of individual points (*outliers*).\n",
    "\n",
    "The box by itself illustrates the interquartile spread of the distribution; its length determined by the $25\\% \\, (\\text{Q1})$ and $75\\% \\, (\\text{Q3})$ percentiles. The vertical line inside the box marks the median ($50\\%$) of the distribution. \n",
    "\n",
    "The whiskers are the lines extending from the box. They represent the entire scatter of data points, specifically the points that fall within the interval $(\\text{Q1} - 1.5 \\cdot \\text{IQR}, \\text{Q3} + 1.5 \\cdot \\text{IQR})$, where $\\text{IQR} = \\text{Q3} - \\text{Q1}$ is the [interquartile range](https://en.wikipedia.org/wiki/Interquartile_range).\n",
    "\n",
    "Outliers that fall out of the range bounded by the whiskers are plotted individually.\n",
    "\n",
    "#### heatmap()\n",
    "\n",
    "The last type of plot that we will cover here is a *heat map*. A heat map allows you to view the distribution of a numerical variable over two categorical ones. Let’s visualize the total sales of games by genre and gaming platform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "platform_genre_sales = (\n",
    "    df.pivot_table(\n",
    "        index=\"Platform\", columns=\"Genre\", values=\"Global_Sales\", aggfunc=sum\n",
    "    )\n",
    "    .fillna(0)\n",
    "    .applymap(float)\n",
    ")\n",
    "sns.heatmap(platform_genre_sales, annot=True, fmt=\".1f\", linewidths=0.5);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Plotly\n",
    "\n",
    "We have examined some visualization tools based on the `matplotlib` library. However, this is not the only option for plotting in `Python`. Let’s take a look at the `plotly` library. Plotly is an open-source library that allows creation of interactive plots within a Jupyter notebook without having to use Javascript.\n",
    "\n",
    "The real beauty of interactive plots is that they provide a user interface for detailed data exploration. For example, you can see exact numerical values by mousing over points, hide uninteresting series from the visualization, zoom in onto a specific part of the plot, etc.\n",
    "\n",
    "Before we start, let’s import all the necessary modules and initialize `plotly` by calling the `init_notebook_mode()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import plotly\n",
    "import plotly.graph_objs as go\n",
    "from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot\n",
    "\n",
    "init_notebook_mode(connected=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Line plot\n",
    "\n",
    "First of all, let’s build a *line plot* showing the number of games released and their sales by year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "years_df = (\n",
    "    df.groupby(\"Year_of_Release\")[[\"Global_Sales\"]]\n",
    "    .sum()\n",
    "    .join(df.groupby(\"Year_of_Release\")[[\"Name\"]].count())\n",
    ")\n",
    "years_df.columns = [\"Global_Sales\", \"Number_of_Games\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Figure` is the main class and a work horse of visualization in `plotly`. It consists of the data (an array of lines called `traces` in this library) and the style (represented by the `layout` object). In the simplest case, you may call the `iplot` function to return only `traces`.\n",
    "\n",
    "The `show_link` parameter toggles the visibility of the links leading to the online platform `plot.ly` in your charts. Most of the time, this functionality is not needed, so you may want to turn it off by passing `show_link=False` to prevent accidental clicks on those links."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a line (trace) for the global sales\n",
    "trace0 = go.Scatter(x=years_df.index, y=years_df[\"Global_Sales\"], name=\"Global Sales\")\n",
    "\n",
    "# Create a line (trace) for the number of games released\n",
    "trace1 = go.Scatter(\n",
    "    x=years_df.index, y=years_df[\"Number_of_Games\"], name=\"Number of games released\"\n",
    ")\n",
    "\n",
    "# Define the data array\n",
    "data = [trace0, trace1]\n",
    "\n",
    "# Set the title\n",
    "layout = {\"title\": \"Statistics for video games\"}\n",
    "\n",
    "# Create a Figure and plot it\n",
    "fig = go.Figure(data=data, layout=layout)\n",
    "iplot(fig, show_link=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As an option, you can save the plot in an html file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plotly.offline.plot(fig, filename=\"years_stats.html\", show_link=False);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Bar chart\n",
    "\n",
    "Let's use a *bar chart* to compare the market share of different gaming platforms broken down by the number of new releases and by total revenue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Do calculations and prepare the dataset\n",
    "platforms_df = (\n",
    "    df.groupby(\"Platform\")[[\"Global_Sales\"]]\n",
    "    .sum()\n",
    "    .join(df.groupby(\"Platform\")[[\"Name\"]].count())\n",
    ")\n",
    "platforms_df.columns = [\"Global_Sales\", \"Number_of_Games\"]\n",
    "platforms_df.sort_values(\"Global_Sales\", ascending=False, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a bar for the global sales\n",
    "trace0 = go.Bar(\n",
    "    x=platforms_df.index, y=platforms_df[\"Global_Sales\"], name=\"Global Sales\"\n",
    ")\n",
    "\n",
    "# Create a bar for the number of games released\n",
    "trace1 = go.Bar(\n",
    "    x=platforms_df.index,\n",
    "    y=platforms_df[\"Number_of_Games\"],\n",
    "    name=\"Number of games released\",\n",
    ")\n",
    "\n",
    "# Get together the data and style objects\n",
    "data = [trace0, trace1]\n",
    "layout = {\"title\": \"Market share by gaming platform\"}\n",
    "\n",
    "# Create a `Figure` and plot it\n",
    "fig = go.Figure(data=data, layout=layout)\n",
    "iplot(fig, show_link=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Box plot\n",
    "\n",
    "`plotly` also supports *box plots*. Let’s consider the distribution of critics' ratings by the genre of the game."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = []\n",
    "\n",
    "# Create a box trace for each genre in our dataset\n",
    "for genre in df.Genre.unique():\n",
    "    data.append(go.Box(y=df[df.Genre == genre].Critic_Score, name=genre))\n",
    "\n",
    "# Visualize\n",
    "iplot(data, show_link=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `plotly`, you can also create other types of visualization. Even with default settings, the plots look quite nice. Additionally, the library makes it easy to modify various parameters: colors, fonts, captions, annotations, and so on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Demo assignment\n",
    "To practice with visual data analysis, you can complete [this assignment](https://www.kaggle.com/kashnitsky/a2-demo-analyzing-cardiovascular-data) where you'll be analyzing cardiovascular disease data.\n",
    "\n",
    "## 6. Useful resources\n",
    "- The same notebook as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-2-part-2-seaborn-and-plotly)\n",
    "- [\"Plotly for interactive plots\"](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/plotly_tutorial_for_interactive_plots_sankovalev.ipynb) - a tutorial by Alexander Kovalev within mlcourse.ai (full list of tutorials is [here](https://mlcourse.ai/tutorials))\n",
    "- [\"Bring your plots to life with Matplotlib animations\"](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/bring_your_plots_to_life_with_matplotlib_animations_kyriacos_kyriacou.ipynb) - a tutorial by Kyriacos Kyriacou within mlcourse.ai\n",
    "- [\"Some details on Matplotlib\"](https://nbviewer.jupyter.org/github/Yorko/mlcourse.ai/blob/master/jupyter_english/tutorials/some_details_in_matplotlib_pisarev_ivan.ipynb) - a tutorial by Ivan Pisarev within mlcourse.ai\n",
    "- Main course [site](https://mlcourse.ai), [course repo](https://github.com/Yorko/mlcourse.ai), and YouTube [channel](https://www.youtube.com/watch?v=QKTuw4PNOsU&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX)\n",
    "- Medium [\"story\"](https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-2-visual-data-analysis-in-python-846b989675cd) based on this notebook\n",
    "- Course materials as a [Kaggle Dataset](https://www.kaggle.com/kashnitsky/mlcourse)\n",
    "- If you read Russian: an [article](https://habrahabr.ru/company/ods/blog/323210/) on Habrahabr with ~ the same material. And a [lecture](https://youtu.be/vm63p8Od0bM) on YouTube\n",
    "- Here is the official documentation for the libraries we used: [`matplotlib`](https://matplotlib.org/contents.html), [`seaborn`](https://seaborn.pydata.org/introduction.html) and [`pandas`](https://pandas.pydata.org/pandas-docs/stable/).\n",
    "- The [gallery](http://seaborn.pydata.org/examples/index.html) of sample charts created with `seaborn` is a very good resource.\n",
    "- Also, see the [documentation](http://scikit-learn.org/stable/modules/manifold.html) on Manifold Learning in `scikit-learn`.\n",
    "- Efficient t-SNE implementation [Multicore-TSNE](https://github.com/DmitryUlyanov/Multicore-TSNE).\n",
    "- \"How to Use t-SNE Effectively\", [Distill.pub](https://distill.pub/2016/misread-tsne/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
